Goto

Collaborating Authors

 ai scraping


AI Scraping and the Open Web

Communications of the ACM

Tussles between websites and scrapers are not new. Almost since there has been a web to scrape, people have been scraping it and using the data to make search engines, caches and archives, analytics platforms, research datasets, and more. And for almost as long, some websites have objected and tried to stop the scraping with a mix of technical and legal measures. Broadly speaking, scrapers cause two kinds of problems for websites. First, they create bad traffic: millions of automated requests that no human will ever see.


Major Sites Are Saying No to Apple's AI Scraping

WIRED

Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it. WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED's parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple's AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they've become a conflict zone over intellectual property and the future of the web. This new tool, Applebot-Extended, is an extension to Apple's web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training.